Goto

Collaborating Authors

 journal article


Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

arXiv.org Artificial Intelligence

Assessing published academic journal articles is a common task for evaluations of departments and individuals. Whilst it is sometimes supported by citation data, Large Language Models (LLMs) may give more useful indications of article quality. Evidence of this capability exists for two of the largest LLM families, ChatGPT and Gemini, and the medium sized LLM Gemma3 27b, but it is unclear whether smaller LLMs and reasoning models have similar abilities. This is important because larger models may be slow and impractical in some situations, and reasoning models may perform differently. Four relevant questions are addressed with Gemma3 variants, Llama4 Scout, Qwen3, Magistral Small and DeepSeek R1, on a dataset of 2,780 medical, health and life science papers in 6 fields, with two different gold standards, one novel. The results suggest that smaller (open weights) and reasoning LLMs have similar performance to ChatGPT 4o-mini and Gemini 2.0 Flash, but that 1b parameters may often, and 4b sometimes, be too few. Moreover, averaging scores from multiple identical queries seems to be a universally successful strategy, and few-shot prompts (four examples) tended to help but the evidence was equivocal. Reasoning models did not have a clear advantage. Overall, the results show, for the first time, that smaller LLMs >4b, including reasoning models, have a substantial capability to score journal articles for research quality, especially if score averaging is used.


ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

arXiv.org Artificial Intelligence

Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.


Can Smaller Large Language Models Evaluate Research Quality?

arXiv.org Artificial Intelligence

Research evaluation is a common and important task for academics and managers, and it is often supported by citation - based indicators (Hicks et al., 2015; Moed, 2005; Mukherjee, 2022). With the increasingly widespread use of Artificial Intelligence (AI) in research ( Mohammadi et al., 2025), it is important to check whether it can save expert time through support of the research evaluation task. ChatGPT research quality score estimates for journal articles are recent alternative s to citations as quantitative indicator s to support evaluations ( Kousha & Thelwall, 2025) . Their value lies in their positive correlation with expert judgement in all or nearly all fields, and at a slightly higher rate than for citation - based indicators ( Thelwall, 2025abc). Despite some systematic biases or disparities ( Thelwall & Kurt, 2025), t his property means that they are helpful when expert judgement fails, such as fo r areas outside of the assessor's expertise, as a cross - check for bias, and for evaluations where assessment expertise is unavailable or too expensive for the value of the task (Thelwall, 2025d) . Whilst a positive correlation with expert judgement has been established for three of the largest Large Language Models (LLMs) in 2025, ChatGPT 4o, ChatGPT 4o - mini, and Google Gemini Flash 1.5 ( Thelwall, 2025ac), these are all cloud - based services and may be too expensive or not private enough for some research evaluation purposes ( Nowak et al., 2025) . Moreover, cloud - based services can be withdrawn, updated, or made more costly, so research evaluation procedures may not be able to rely on them. Thus, there is a need to test whether any smaller "open weights" LLMs ( Sowe et al., 2024) that can be downloaded and used offline have a capability to estimate research quality.


Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers

arXiv.org Artificial Intelligence

Scientific research is inherently global. However, the vast majority of academic journals are published exclusively in English, creating barriers for non-native-English-speaking researchers. In this study, we leverage large language models (LLMs) to translate published scientific articles while preserving their native JATS XML formatting, thereby developing a practical, automated approach for implementation by academic journals. Using our approach, we translate articles across multiple scientific disciplines into 28 languages. To evaluate translation accuracy, we introduce a novel question-and-answer (QA) benchmarking method, in which an LLM generates comprehension-based questions from the original text and then answers them based on the translated text. Our benchmark results show an average performance of 95.9%, showing that the key scientific details are accurately conveyed. In a user study, we translate the scientific papers of 15 researchers into their native languages, finding that the authors consistently found the translations to accurately capture the original information in their articles. Interestingly, a third of the authors found many technical terms "overtranslated," expressing a preference to keep terminology more familiar in English untranslated. Finally, we demonstrate how in-context learning techniques can be used to align translations with domain-specific preferences such as mitigating overtranslation, highlighting the adaptability and utility of LLM-driven scientific translation. The code and translated articles are available at https://hankleid.github.io/ProjectMundo.


Evaluating the quality of published medical research with ChatGPT

arXiv.org Artificial Intelligence

Research quality evaluation is important for departmental evaluations and academic career decisions. Unfortunately, the evaluators may not have time to fully read the work assessed and may instead rely on the reputation or Journal Impact Factor of the publishing journals, on the citation counts for individual articles, or on the reputation or career citations of the author. Whilst journal-based evidence is not optimal (Waltman & Traag, 2021), the main article-level indicator, citation counts, only directly reflects the scholarly impact of work and not its rigour, originality, and societal impacts (Aksnes, et al., 2019), all of which are relevant quality dimensions (Langfeldt et al., 2020). Moreover, article citation counts are ineffective for newer articles (Wang, 2013). In response, attempts to use Large Language Models (LLMs) to evaluate the quality of academic work have shown that ChatGPT quality scores are at least as effective as citation counts in most fields and substantially better in a few (Thelwall & Yaghi, 2024). Medicine is an exception, however, with ChatGPT research quality scores having a small negative correlation with the mean scores of the submitting department in the Research Excellence Framework (REF) Clinical Medicine Unit of Assessment (UoA) (Thelwall, 2024ab; Thelwall & Yaghi, 2024).


Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs

arXiv.org Artificial Intelligence

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process. This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts. The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.


Topics in the Study of the Pragmatic Functions of Phonetic Reduction in Dialog

arXiv.org Artificial Intelligence

Feeling that our inventory of prosodic features was incomplete, we set out to add phonetic reduction to the features handled by the Midlevel Prosodic Features Toolkit (Ward 2023). We failed in this goal, but in the process learned a lot about reduction. The headline finding was the result that phonetic reduction correlates with positive assessments in American English, and that result, plus closely related topics, reported in a journal article submission (Ward et al. 2024). However, not everything that we learned fit there, however, so this document reports the rest. Some of the discussions are stand-alone -- notably those of spectral tilt, annotation for reduction, and prosodic correlates of reduction, as found in Sections 4-5 -- but most readers will want to start with the journal article and use this document only for details and leftovers.


Supervised machine learning for microbiomics: bridging the gap between current and best practices

arXiv.org Artificial Intelligence

Machine learning (ML) is set to accelerate innovations in clinical microbiomics, such as in disease diagnostics and prognostics. This will require high-quality, reproducible, interpretable workflows whose predictive capabilities meet or exceed the high thresholds set for clinical tools by regulatory agencies. Here, we capture a snapshot of current practices in the application of supervised ML to microbiomics data, through an in-depth analysis of 100 peer-reviewed journal articles published in 2021-2022. We apply a data-driven approach to steer discussion of the merits of varied approaches to experimental design, including key considerations such as how to mitigate the effects of small dataset size while avoiding data leakage. We further provide guidance on how to avoid common experimental design pitfalls that can hurt model performance, trustworthiness, and reproducibility. Discussion is accompanied by an interactive online tutorial that demonstrates foundational principles of ML experimental design, tailored to the microbiomics community. Formalizing community best practices for supervised ML in microbiomics is an important step towards improving the success and efficiency of clinical research, to the benefit of patients and other stakeholders.


Can ChatGPT evaluate research quality?

arXiv.org Artificial Intelligence

Purpose: Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task. Design/methodology/approach: Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements. Findings: ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author's significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations. Research limitations: The data is self-evaluations of a convenience sample of articles from one academic in one field. Practical implications: Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use. Originality/value: This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.


JCoLA: Japanese Corpus of Linguistic Acceptability

arXiv.org Artificial Intelligence

Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.